David Durán Prieto
Gerardo Adrián Aguirre Vivar
Ana Jiménez Santamaría
El data set que ha sido elegido proviene de una encuesta realizada por la PSA (Philippine Statistics Authority) donde se recogen los gastos e ingresos por familia en las Islas Filipinas. Contiene más de 40000 observaciones y 60 variables, que han sido agrupadas en las siguientes categorías:
Durante varios años, identificar un modelo de clasificación socio-económico óptimo en Filipinas ha sido un tema difícil de abordar. A día de hoy, ningun modelo ha sido aceptado de forma global, y los diferentes organismos gubernamentales que existen utilizan sus propios modelos. Por ello, el presente trabajo se plantea un objetivo: diseñar un modelo que consiga abordar el problema y resolverlo de manera eficaz.
Objetivo: Predecir los ingresos de una familia filipina, basándse en los datos disponibles. Pregunta: A partir de un modelo de regresión lineal múltiple, ¿qué variables son las más adecuadas para predecir los ingresos? Target: La variable respuesta es el total de ingresos de cada familia filipina (Total.Household.Income)
El análisis de dividirá en dos fases:
La primera fase consistirá en un análisis exploratorio de los datos para entender mejor el significado y la relevancia de cada una de las variables. Se estudiarán puntos clave como el nivel de correlación entre la variable de interés y las demás. Por ello, para cada variable estudiada, se planteará:
La segunda fase consistirá en la elaboración de un modelo de regresión lineal múltiple con las variables predictoras seleccionadas.
Antes de proceder con la visualización gráfica de las variables (para tener un visión de la distribución de nuestros datos), será realizado un preprocesamiento y limpieza del conjunto de datos. Serán etiquetados como NA aquellos valores que así deban considerarse; se eliminarán ciertas variables por no presentar interés para el objetivo planteado o por no estar bien categorizadas, y por último, seran preparados los conjuntos de test/validación y de train. Este último se utilizará para entrenar el modelo de predicción, que será después evaluado con el conjunto de test/validación.
# ----- Se cargan las librerías que serán necesarias ------
library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
library(GGally)
library(gridExtra)
library(egg)
library(VIM)
library(vcd)
library(Hmisc)
library(readr)
library(moments)
library(caret)
library(gmodels)
library(reshape)
library(ggcorrplot)
library(knitr)
library(kableExtra)
A continuación, se realizará un resumen de los estadísticos principales de las variables numéricas. Curiosamente, solo se encuentran datos faltantes en las variables categóricas, que más adelante se tratarán.
# ----- Carga de datos -----
datos<-read.csv('Family_Income_and_Expenditure.csv',stringsAsFactors = TRUE)
datos$Electricity<-as.factor(datos$Electricity)
datos_occupation <- datos
# ----- Resumen numérico de las variables -----
kable(summary(datos),caption="Summary del conjunto de datos") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| Total.Household.Income | Region | Total.Food.Expenditure | Main.Source.of.Income | Agricultural.Household.indicator | Bread.and.Cereals.Expenditure | Total.Rice.Expenditure | Meat.Expenditure | Total.Fish.and..marine.products.Expenditure | Fruit.Expenditure | Vegetables.Expenditure | Restaurant.and.hotels.Expenditure | Alcoholic.Beverages.Expenditure | Tobacco.Expenditure | Clothing..Footwear.and.Other.Wear.Expenditure | Housing.and.water.Expenditure | Imputed.House.Rental.Value | Medical.Care.Expenditure | Transportation.Expenditure | Communication.Expenditure | Education.Expenditure | Miscellaneous.Goods.and.Services.Expenditure | Special.Occasions.Expenditure | Crop.Farming.and.Gardening.expenses | Total.Income.from.Entrepreneurial.Acitivites | Household.Head.Sex | Household.Head.Age | Household.Head.Marital.Status | Household.Head.Highest.Grade.Completed | Household.Head.Job.or.Business.Indicator | Household.Head.Occupation | Household.Head.Class.of.Worker | Type.of.Household | Total.Number.of.Family.members | Members.with.age.less.than.5.year.old | Members.with.age.5…17.years.old | Total.number.of.family.members.employed | Type.of.Building.House | Type.of.Roof | Type.of.Walls | House.Floor.Area | House.Age | Number.of.bedrooms | Tenure.Status | Toilet.Facilities | Electricity | Main.Source.of.Water.Supply | Number.of.Television | Number.of.CD.VCD.DVD | Number.of.Component.Stereo.set | Number.of.Refrigerator.Freezer | Number.of.Washing.Machine | Number.of.Airconditioner | Number.of.Car..Jeep..Van | Number.of.Landline.wireless.telephones | Number.of.Cellular.phone | Number.of.Personal.Computer | Number.of.Stove.with.Oven.Gas.Range | Number.of.Motorized.Banca | Number.of.Motorcycle.Tricycle | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 11285 | IVA - CALABARZON : 4162 | Min. : 2947 | Enterpreneurial Activities:10320 | Min. :0.0000 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 1950 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Female: 9061 | Min. : 9.00 | Annulled : 11 | High School Graduate : 9628 | No Job/Business : 7536 | Farmhands and laborers : 3478 | Self-employed wihout any employee :13766 | Extended Family :12932 | Min. : 1.000 | Min. :0.0000 | Min. :0.000 | Min. :0.000 | Commercial/industrial/agricultural building: 51 | Light material (cogon,nipa,anahaw) : 5074 | Light : 8267 | Min. : 5.0 | Min. : 0.00 | Min. :0.000 | Own or owner-like possession of house and lot :29541 | Water-sealed, sewer septic tank, used exclusively by household:29162 | 0: 4536 | Own use, faucet, community water system:16093 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00000 | Min. : 0.000 | Min. :0.000 | Min. :0.000 | Min. :0.00000 | Min. :0.0000 | |
| 1st Qu.: 104895 | NCR : 4130 | 1st Qu.: 51017 | Other sources of Income :10836 | 1st Qu.:0.0000 | 1st Qu.: 16556 | 1st Qu.: 11020 | 1st Qu.: 3354 | 1st Qu.: 5504 | 1st Qu.: 1025 | 1st Qu.: 2873 | 1st Qu.: 1930 | 1st Qu.: 0 | 1st Qu.: 0 | 1st Qu.: 1365 | 1st Qu.: 13080 | 1st Qu.: 6000 | 1st Qu.: 300 | 1st Qu.: 2412 | 1st Qu.: 564 | 1st Qu.: 0 | 1st Qu.: 3792 | 1st Qu.: 0 | 1st Qu.: 0 | 1st Qu.: 0 | Male :32483 | 1st Qu.:41.00 | Divorced/Separated: 1425 | Elementary Graduate : 7640 | With Job/Business:34008 | Rice farmers : 2849 | Worked for private establishment :13731 | Single Family :28445 | 1st Qu.: 3.000 | 1st Qu.:0.0000 | 1st Qu.:0.000 | 1st Qu.:0.000 | Duplex : 1084 | Mixed but predominantly light materials : 846 | NOt applicable: 12 | 1st Qu.: 25.0 | 1st Qu.: 10.00 | 1st Qu.:1.000 | Own house, rent-free lot with consent of owner : 6165 | Water-sealed, sewer septic tank, shared with other household : 3694 | 1:37008 | Shared, tubed/piped deep well : 6242 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.00000 | 1st Qu.:0.00000 | 1st Qu.: 1.000 | 1st Qu.:0.000 | 1st Qu.:0.000 | 1st Qu.:0.00000 | 1st Qu.:0.0000 | |
| Median : 164080 | III - Central Luzon : 3237 | Median : 72986 | Wage/Salaries :20388 | Median :0.0000 | Median : 23324 | Median : 16620 | Median : 7332 | Median : 8695 | Median : 1820 | Median : 4314 | Median : 7314 | Median : 270 | Median : 300 | Median : 2740 | Median : 22992 | Median : 10800 | Median : 1125 | Median : 6036 | Median : 1506 | Median : 880 | Median : 6804 | Median : 1500 | Median : 0 | Median : 19222 | NA | Median :51.00 | Married :31347 | Grade 4 : 2282 | NA | General managers/managing proprietors in wholesale and retail trade : 2028 | Worked for government/government corporation : 2820 | Two or More Nonrelated Persons/Members: 167 | Median : 4.000 | Median :0.0000 | Median :1.000 | Median :1.000 | Institutional living quarter : 9 | Mixed but predominantly salvaged materials : 56 | Quite Strong : 3487 | Median : 40.0 | Median : 17.00 | Median :2.000 | Rent house/room including lot : 2203 | Water-sealed, other depository, used exclusively by household : 2343 | NA | Shared, faucet, community water system : 4614 | Median :1.0000 | Median :0.0000 | Median :0.0000 | Median :0.0000 | Median :0.0000 | Median :0.0000 | Median :0.00000 | Median :0.00000 | Median : 2.000 | Median :0.000 | Median :0.000 | Median :0.00000 | Median :0.0000 | |
| Mean : 247556 | VI - Western Visayas : 2851 | Mean : 85099 | NA | Mean :0.4299 | Mean : 25134 | Mean : 18196 | Mean : 10540 | Mean : 10529 | Mean : 2550 | Mean : 5007 | Mean : 15437 | Mean : 1085 | Mean : 2295 | Mean : 4955 | Mean : 38376 | Mean : 20922 | Mean : 7160 | Mean : 11806 | Mean : 4095 | Mean : 7474 | Mean : 12522 | Mean : 5266 | Mean : 13817 | Mean : 54376 | NA | Mean :51.38 | Single : 1942 | Grade 5 : 2123 | NA | General managers/managing proprietors in transportation, storage and communications: 1932 | Employer in own family-operated farm or business: 2581 | NA | Mean : 4.635 | Mean :0.4102 | Mean :1.363 | Mean :1.273 | Multi-unit residential : 1329 | Mixed but predominantly strong materials : 2002 | Salvaged : 456 | Mean : 55.6 | Mean : 20.13 | Mean :1.788 | Rent-free house and lot with consent of owner : 2014 | Closed pit : 2273 | NA | Own use, tubed/piped deep well : 4587 | Mean :0.8569 | Mean :0.4352 | Mean :0.1621 | Mean :0.3942 | Mean :0.3198 | Mean :0.1298 | Mean :0.08121 | Mean :0.06061 | Mean : 1.906 | Mean :0.315 | Mean :0.135 | Mean :0.01312 | Mean :0.2899 | |
| 3rd Qu.: 291138 | VII - Central Visayas: 2541 | 3rd Qu.:105636 | NA | 3rd Qu.:1.0000 | 3rd Qu.: 31439 | 3rd Qu.: 23920 | 3rd Qu.: 14292 | 3rd Qu.: 13388 | 3rd Qu.: 3100 | 3rd Qu.: 6304 | 3rd Qu.: 19921 | 3rd Qu.: 1299 | 3rd Qu.: 3146 | 3rd Qu.: 5580 | 3rd Qu.: 45948 | 3rd Qu.: 24000 | 3rd Qu.: 4680 | 3rd Qu.: 13776 | 3rd Qu.: 3900 | 3rd Qu.: 4060 | 3rd Qu.: 14154 | 3rd Qu.: 5000 | 3rd Qu.: 6313 | 3rd Qu.: 65969 | NA | 3rd Qu.:61.00 | Unknown : 1 | Second Year High School: 2104 | NA | Corn farmers : 1724 | Worked for private household : 811 | NA | 3rd Qu.: 6.000 | 3rd Qu.:1.0000 | 3rd Qu.:2.000 | 3rd Qu.:2.000 | Other building unit (e.g. cave, boat) : 2 | Not Applicable : 12 | Strong :27739 | 3rd Qu.: 70.0 | 3rd Qu.: 26.00 | 3rd Qu.:2.000 | Own house, rent-free lot without consent of owner: 995 | None : 1580 | NA | Dug well : 3876 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:0.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:0.0000 | 3rd Qu.:0.00000 | 3rd Qu.:0.00000 | 3rd Qu.: 3.000 | 3rd Qu.:0.000 | 3rd Qu.:0.000 | 3rd Qu.:0.00000 | 3rd Qu.:0.0000 | |
| Max. :11815988 | V - Bicol Region : 2472 | Max. :827565 | NA | Max. :2.0000 | Max. :765864 | Max. :758326 | Max. :261566 | Max. :188208 | Max. :273769 | Max. :74800 | Max. :725296 | Max. :59592 | Max. :139370 | Max. :356750 | Max. :2188560 | Max. :1920000 | Max. :1049275 | Max. :834996 | Max. :149940 | Max. :731000 | Max. :553560 | Max. :556700 | Max. :3729973 | Max. :9234485 | NA | Max. :99.00 | Widowed : 6818 | Grade 3 : 1994 | NA | (Other) :21997 | (Other) : 299 | NA | Max. :26.000 | Max. :5.0000 | Max. :8.000 | Max. :8.000 | Single house :39069 | Salvaged/makeshift materials : 212 | Very Light : 1583 | Max. :998.0 | Max. :200.00 | Max. :9.000 | Own house, rent lot : 425 | Open pit : 1189 | NA | Protected spring, river, stream, etc : 2657 | Max. :6.0000 | Max. :5.0000 | Max. :5.0000 | Max. :5.0000 | Max. :3.0000 | Max. :5.0000 | Max. :5.00000 | Max. :4.00000 | Max. :10.000 | Max. :6.000 | Max. :3.000 | Max. :3.00000 | Max. :5.0000 | |
| NA | (Other) :22151 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | (Other) :15773 | NA | NA’s : 7536 | NA’s : 7536 | NA | NA | NA | NA | NA | NA | Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos):33342 | NA | NA | NA | NA | (Other) : 201 | (Other) : 1303 | NA | (Other) : 3475 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
# ----- División de variables en categóricas y numéricas -----
nums <- datos %>%
select_if(is.numeric)
cat <- datos %>%
select_if(is.factor)
A la luz de la escasa documentación referida al conjunto de datos, ha sido imposible descifrar el significado de algunas variables. Por ejemplo, la variable Agricultural.Household.indicator posee 3 valores 0,1 y 2. Tras no conocer su interpretación se decide eliminar de nuestro estudio.
# ----- Eliminación de variables del dataset -----
datos<-datos%>%select(-Agricultural.Household.indicator,-Members.with.age.less.than.5.year.old,-Members.with.age.5...17.years.old,-Household.Head.Occupation)
Una vez descartadas aquellas variables, se irán etiquetando como NA (missing values) todos aquellos valores considerados erróneos o no recogidos. Por ejemplo, en algunas variables categóricas existe la categoría unknown, not applicable, etc. En el caso de las cuantitativas hay que tener cierto cuidado ya que algunas tienen valor 0. Este valor no parace indicar que se trate de un NA ya que en casos de familias en situación de extrema pobreza, un gasto de 0 en restaurantes y hoteles tiene sentido. Se recuerda que en este data set se trabajan con datos socio económicos en los cuales el valor 0 tiene sentido.
También, se categorizarán ciertas variables, seleccionando las posibles categorías que podrán adquirir.
La función summary ofrece un resumen de cada variable del data set. De esta forma, se podrá inspeccionar fácilmente infomación como los estadísticos básicos en variables numéricas, o la proporción de cada tipo de categorías dentro de las variables categóricas.
Este primer paso no busca un análisis exahustivo de los datos, sino más bien un contacto inicial que permita detectar anomalías a primera vista. Se comentarán sólo los resultados destacables.
# -----Corrección de valores en variables y categorización -----
kable(summary(datos$Main.Source.of.Income),caption="Summary de Main.Source.of.Income") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Enterpreneurial Activities | 10320 |
| Other sources of Income | 10836 |
| Wage/Salaries | 20388 |
datos$Main.Source.of.Income = factor(datos$Main.Source.of.Income,ordered=TRUE,levels=(c('Other sources of Income'
, 'Enterpreneurial Activities'
, 'Wage/Salaries')))
#--------------------------------------------------
kable(summary(datos$Household.Head.Marital.Status),caption="Summary de Household.Head.Marital.Status") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Annulled | 11 |
| Divorced/Separated | 1425 |
| Married | 31347 |
| Single | 1942 |
| Unknown | 1 |
| Widowed | 6818 |
datos$Household.Head.Marital.Status[which(datos$Household.Head.Marital.Status=='Unknown')] <-NA # Se etiqueta como NA el valor "Unknown" (desconocido)
datos$Household.Head.Marital.Status<-fct_drop(datos$Household.Head.Marital.Status)
datos$Household.Head.Marital.Status =
factor(datos$Household.Head.Marital.Status,ordered=TRUE,levels=
(c('Single'
,'Widowed'
,'Annulled'
,'Divorced/Separated'
,'Married')))
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(summary(datos$Household.Head.Class.of.Worker),caption="Summary de Household.Head.Class.of.Worker") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Employer in own family-operated farm or business | 2581 |
| Self-employed wihout any employee | 13766 |
| Worked for government/government corporation | 2820 |
| Worked for private establishment | 13731 |
| Worked for private household | 811 |
| Worked with pay in own family-operated farm or business | 14 |
| Worked without pay in own family-operated farm or business | 285 |
| NA’s | 7536 |
datos$Household.Head.Class.of.Worker =
factor(datos$Household.Head.Class.of.Worker,ordered=TRUE,levels=
(c('Worked without pay in own family-operated farm or business'
,'Employer in own family-operated farm or business'
,'Worked with pay in own family-operated farm or business'
,'Self-employed wihout any employee'
,'Worked for private household'
,'Worked for private establishment'
,'Worked for government/government corporation')))
#---------------------------------------------------------------------------------------------------------------------------------------------
datos$Type.of.Household =
factor(datos$Type.of.Household,ordered=TRUE,levels=
(c('Single Family'
,'Two or More Nonrelated Persons/Members'
,'Extended Family')))
kable(levels(datos$Type.of.Household),caption="Summary de Type.of.Household") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x |
|---|
| Single Family |
| Two or More Nonrelated Persons/Members |
| Extended Family |
#---------------------------------------------------------------------------------------------------------------------------------------------
datos$Type.of.Building.House =
factor(datos$Type.of.Building.House,ordered=TRUE,levels=
(c('Other building unit (e.g. cave, boat)'
,'Institutional living quarter'
,'Commercial/industrial/agricultural building'
,'Single house'
,'Duplex'
,'Multi-unit residential')))
kable(levels(datos$Type.of.Building.House),caption="Summary de Type.of.Building.House") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x |
|---|
| Other building unit (e.g. cave, boat) |
| Institutional living quarter |
| Commercial/industrial/agricultural building |
| Single house |
| Duplex |
| Multi-unit residential |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(summary(datos$Type.of.Roof),caption="Summary de Type.of.Roof") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Light material (cogon,nipa,anahaw) | 5074 |
| Mixed but predominantly light materials | 846 |
| Mixed but predominantly salvaged materials | 56 |
| Mixed but predominantly strong materials | 2002 |
| Not Applicable | 12 |
| Salvaged/makeshift materials | 212 |
| Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos) | 33342 |
datos$Type.of.Roof[which(datos$Type.of.Roof=='Not Applicable')] <-NA # Se etiqueta como NA el valor "Not Applicable" (no aplicable)
datos$Type.of.Roof<-fct_drop(datos$Type.of.Roof)
datos$Type.of.Roof =
factor(datos$Type.of.Roof,ordered=TRUE,levels=
(c('Salvaged/makeshift materials'
,'Light material (cogon,nipa,anahaw)'
,'Mixed but predominantly salvaged materials'
,'Mixed but predominantly light materials'
,'Mixed but predominantly strong materials'
,'Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)')))
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(summary(datos$Type.of.Walls),caption="Summary de Type.of.Walls") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Light | 8267 |
| NOt applicable | 12 |
| Quite Strong | 3487 |
| Salvaged | 456 |
| Strong | 27739 |
| Very Light | 1583 |
datos$Type.of.Walls[which(datos$Type.of.Walls=='Not applicable')] <-NA # Se etiqueta como NA el valor "Not Applicable" (no aplicable)
datos$Type.of.Walls<-fct_drop(datos$Type.of.Walls)
datos$Type.of.Walls=
factor(datos$Type.of.Walls,ordered=TRUE,levels=
(c('Salvaged'
,'Very Light'
,'Light'
,'Strong'
,'Quite Strong')))
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(summary(datos$Toilet.Facilities),caption="Summary de Toilet.Facilities") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Closed pit | 2273 |
| None | 1580 |
| Open pit | 1189 |
| Others | 353 |
| Water-sealed, other depository, shared with other household | 950 |
| Water-sealed, other depository, used exclusively by household | 2343 |
| Water-sealed, sewer septic tank, shared with other household | 3694 |
| Water-sealed, sewer septic tank, used exclusively by household | 29162 |
datos$Toilet.Facilities=
factor(datos$Toilet.Facilities,ordered=TRUE,levels=
(c('None'
,'Others'
,'Open pit'
,'Closed pit'
,'Water-sealed, other depository, shared with other household'
,'Water-sealed, other depository, used exclusively by household'
,'Water-sealed, sewer septic tank, shared with other household'
,'Water-sealed, sewer septic tank, used exclusively by household')))
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(summary(datos$Main.Source.of.Water.Supply),caption="Summary de Main.Source.of.Water.Supply") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Dug well | 3876 |
| Lake, river, rain and others | 496 |
| Others | 127 |
| Own use, faucet, community water system | 16093 |
| Own use, tubed/piped deep well | 4587 |
| Peddler | 851 |
| Protected spring, river, stream, etc | 2657 |
| Shared, faucet, community water system | 4614 |
| Shared, tubed/piped deep well | 6242 |
| Tubed/piped shallow well | 1394 |
| Unprotected spring, river, stream, etc | 607 |
datos$Main.Source.of.Water.Supply=
factor(datos$Main.Source.of.Water.Supply,ordered=TRUE,levels=
(c('Others'
,'Dug well'
,'Lake, river, rain and others'
,'Unprotected spring, river, stream, etc'
,'Protected spring, river, stream, etc'
,'Tubed/piped shallow well'
,'Shared, tubed/piped deep well'
,'Own use, tubed/piped deep well'
,'Peddler'
,'Shared, faucet, community water system'
,'Own use, faucet, community water system')))
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(summary(datos$Tenure.Status),caption="Summary de Tenure.Status") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Not Applicable | 73 |
| Own house, rent lot | 425 |
| Own house, rent-free lot with consent of owner | 6165 |
| Own house, rent-free lot without consent of owner | 995 |
| Own or owner-like possession of house and lot | 29541 |
| Rent house/room including lot | 2203 |
| Rent-free house and lot with consent of owner | 2014 |
| Rent-free house and lot without consent of owner | 128 |
datos$Tenure.Status[which(datos$Tenure.Status=='Not Applicable')] <-NA # Se etiqueta como NA el valor "Not Applicable" (no aplicable)
datos$Tenure.Status<-fct_drop(datos$Tenure.Status)
La variable “Electricity” es un claro ejemplo de la importancia de no tratar como NA todos aquellos valores iguales a 0. Puesto que no existe ninguna descripción de las variables del dataset, más allá del propio nombre, se trata de ver a qué se refieren esos 0. A continuación, se muestra una representación gráfica de la variable:
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(summary(datos$Electricity),caption="Summary de Electricity") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| 0 | 4536 |
| 1 | 37008 |
ggplot(datos, aes(x=Number.of.Airconditioner,fill= Electricity)) + geom_bar(position = "dodge")
A la vista de la gráfica, se concluye que todos los usuarios que tienen aire acondicionado, tienen un 1 en Electricity, y que ningún usuario con un 0 tiene aire acondicionado, por lo que es posible afirmar que el 1 corresponde a tener electricidad, y el 0 a no tenerla.
Para clarificar, será categorizada con valores de “Si” y “No”, que sustituirán a los unos y ceros, respectivamente.
# Sustitución de 0/1 por No/Si
datos$Electricity<-fct_recode(datos$Electricity,Si='1',No='0')
El último grupo de variables, corresponde al número de bienes adquiridos. Dichas variables son marcadas como numéricas, pero sus rangos son muy reducidos con respecto a las demás. Se inspecciona más a fondo estas variables viendo sus principales estadísticos:
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.bedrooms)),caption="Summary de Number.of.bedrooms") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.000000 |
| 1st Qu. | 1.000000 |
| Median | 2.000000 |
| Mean | 1.788008 |
| 3rd Qu. | 2.000000 |
| Max. | 9.000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Refrigerator.Freezer)),caption="Summary de Number.of.Refrigerator.Freezer") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.3941845 |
| 3rd Qu. | 1.0000000 |
| Max. | 5.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Washing.Machine)),caption="Summary de Number.of.Washing.Machine") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.3198055 |
| 3rd Qu. | 1.0000000 |
| Max. | 3.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Airconditioner)),caption="Summary de Number.of.Airconditioner") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.1298142 |
| 3rd Qu. | 0.0000000 |
| Max. | 5.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Car..Jeep..Van)),caption="Summary de Number.of.Car..Jeep..Van") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.0812151 |
| 3rd Qu. | 0.0000000 |
| Max. | 5.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.CD.VCD.DVD)),caption="Summary de Number.of.CD.VCD.DVD") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.4352253 |
| 3rd Qu. | 1.0000000 |
| Max. | 5.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Cellular.phone)),caption="Summary de Number.of.Cellular.phone") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.000000 |
| 1st Qu. | 1.000000 |
| Median | 2.000000 |
| Mean | 1.905739 |
| 3rd Qu. | 3.000000 |
| Max. | 10.000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Component.Stereo.set)),caption="Summary de Number.of.Component.Stereo.set") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.1621413 |
| 3rd Qu. | 0.0000000 |
| Max. | 5.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Landline.wireless.telephones)),caption="Summary de Number.of.Landline.wireless.telephones") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.0606104 |
| 3rd Qu. | 0.0000000 |
| Max. | 4.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Personal.Computer)),caption="Summary de Number.of.Personal.Computer") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.3150154 |
| 3rd Qu. | 0.0000000 |
| Max. | 6.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Motorcycle.Tricycle)),caption="Summary de Number.of.Motorcycle.Tricycle") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.2898854 |
| 3rd Qu. | 0.0000000 |
| Max. | 5.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Stove.with.Oven.Gas.Range)),caption="Summary de Number.of.Stove.with.Oven.Gas.Range") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.1350376 |
| 3rd Qu. | 0.0000000 |
| Max. | 3.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Television)),caption="Summary de Number.of.Television") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 1.0000000 |
| Mean | 0.8568746 |
| 3rd Qu. | 1.0000000 |
| Max. | 6.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
kable(c(summary(datos$Number.of.Motorized.Banca)),caption="Summary de Number.of.Motorized.Banca") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "100%")
| x | |
|---|---|
| Min. | 0.0000000 |
| 1st Qu. | 0.0000000 |
| Median | 0.0000000 |
| Mean | 0.0131186 |
| 3rd Qu. | 0.0000000 |
| Max. | 3.0000000 |
#---------------------------------------------------------------------------------------------------------------------------------------------
Una vez están los datos ordenados, y debido a que el volumen de la muestra inicial podría ser un problema al tratar con ellos, se procede a realizar un muestreo de 10000 observaciones con muestreo aleatorio simple, fijando una semilla aleatoria.
Se divide la muestra de 10000 observaciones en dos conjuntos: uno de train y otro de test/validación (70%-30%). Se trabajará con el conjunto de train, mientras que el de test será reservado para la parte final (evaluación del modelo).
# ----- Creación de una muestra del conjunto inicial de datos con muestreo aleatorio simple sin reemplazamiento -----
set.seed(300)
datos_s <- datos %>%
sample_n(size=10000,replace=FALSE)
# División de la muestra de 10000 observaciones en dos conjuntos: uno de train y otro de test (70%-30%)
training <- createDataPartition(pull(datos_s, Total.Household.Income ),
p = 0.7, list = FALSE, times = 1)
datos_training <- slice(datos_s, training)
datos_testing <- slice(datos_s, -training)
var_train_cat <- datos_training%>%select_if(is.factor)
var_train_num <- datos_training%>%select_if(is.numeric)
Para estudiar más a fondo las variables cualitativas, es conveniente ver sus frecuencias absolutas, una a una, con ayuda de la función table(). Sacar la frecuencia absoluta de cada variable cualitativa es importante para determinar cuán homogénea es la población (por ejemplo, tipo de trabajo, sexo o tipo de casas más comunes). Además, al disponer de variables muy específicas sobre la estructura de la casa y bienes básicos del hogar, podrá verse fácilmente si la mayoría de la población dispone de una calidad de vida digna, o por el contrario, tiende a estar en niveles de pobreza alarmantes (por ejemplo, tener electricidad o no tener).
Region:
# ----- Frecuencias absolutas y relativas ------
# Frecuencias absolutas - función table() (tabla de contingencia)
table(var_train_cat$Region)
##
## ARMM CAR Caraga
## 375 299 299
## I - Ilocos Region II - Cagayan Valley III - Central Luzon
## 401 367 532
## IVA - CALABARZON IVB - MIMAROPA IX - Zasmboanga Peninsula
## 658 193 319
## NCR V - Bicol Region VI - Western Visayas
## 724 412 497
## VII - Central Visayas VIII - Eastern Visayas X - Northern Mindanao
## 416 397 296
## XI - Davao Region XII - SOCCSKSARGEN
## 432 383
Main.Source.of.Income
table(var_train_cat$Main.Source.of.Income)
##
## Other sources of Income Enterpreneurial Activities
## 1790 1771
## Wage/Salaries
## 3439
Household.Head.Job.or.Business.Indicator
table(var_train_cat$Household.Head.Job.or.Business.Indicator)
##
## No Job/Business With Job/Business
## 1241 5759
Household.Head.Sex
table(var_train_cat$Household.Head.Sex)
##
## Female Male
## 1513 5487
Household.Head.Marital.Status
table(var_train_cat$Household.Head.Marital.Status)
##
## Single Widowed Annulled Divorced/Separated
## 319 1143 4 223
## Married
## 5311
Household.Head.Job.or.Business.Indicator
table(var_train_cat$Household.Head.Job.or.Business.Indicator)
##
## No Job/Business With Job/Business
## 1241 5759
Household.Head.Class.of.Worker
table(var_train_cat$Household.Head.Class.of.Worker)
##
## Worked without pay in own family-operated farm or business
## 50
## Employer in own family-operated farm or business
## 434
## Worked with pay in own family-operated farm or business
## 4
## Self-employed wihout any employee
## 2304
## Worked for private household
## 139
## Worked for private establishment
## 2339
## Worked for government/government corporation
## 489
Type.of.Household
table(var_train_cat$Type.of.Household)
##
## Single Family Two or More Nonrelated Persons/Members
## 4790 24
## Extended Family
## 2186
Type.of.Building.House
table(var_train_cat$Type.of.Building.House)
##
## Other building unit (e.g. cave, boat)
## 0
## Institutional living quarter
## 2
## Commercial/industrial/agricultural building
## 7
## Single house
## 6584
## Duplex
## 173
## Multi-unit residential
## 234
Type.of.Roof
table(var_train_cat$Type.of.Roof)
##
## Salvaged/makeshift materials
## 32
## Light material (cogon,nipa,anahaw)
## 845
## Mixed but predominantly salvaged materials
## 11
## Mixed but predominantly light materials
## 133
## Mixed but predominantly strong materials
## 345
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)
## 5633
Type.of.Walls
table(var_train_cat$Type.of.Walls)
##
## Salvaged Very Light Light Strong Quite Strong
## 81 258 1366 4701 590
Tenure.Status
table(var_train_cat$Tenure.Status)
##
## Own house, rent lot
## 81
## Own house, rent-free lot with consent of owner
## 1005
## Own house, rent-free lot without consent of owner
## 151
## Own or owner-like possession of house and lot
## 5011
## Rent house/room including lot
## 389
## Rent-free house and lot with consent of owner
## 336
## Rent-free house and lot without consent of owner
## 20
Toilet.Facilities
table(var_train_cat$Toilet.Facilities)
##
## None
## 255
## Others
## 65
## Open pit
## 206
## Closed pit
## 361
## Water-sealed, other depository, shared with other household
## 133
## Water-sealed, other depository, used exclusively by household
## 428
## Water-sealed, sewer septic tank, shared with other household
## 646
## Water-sealed, sewer septic tank, used exclusively by household
## 4906
Electricity
table(var_train_cat$Electricity)
##
## No Si
## 753 6247
Main.Source.of.Water.Supply
table(var_train_cat$Main.Source.of.Water.Supply)
##
## Others Dug well
## 16 663
## Lake, river, rain and others Unprotected spring, river, stream, etc
## 81 105
## Protected spring, river, stream, etc Tubed/piped shallow well
## 459 251
## Shared, tubed/piped deep well Own use, tubed/piped deep well
## 1012 775
## Peddler Shared, faucet, community water system
## 147 769
## Own use, faucet, community water system
## 2722
Number.of.Motorcycle.Tricycle
table(var_train_cat$Number.of.Motorcycle.Tricycle)
## < table of extent 0 >
Household.Head.Highest.Grade.Completed
table(var_train_cat$Household.Head.Highest.Grade.Completed)
##
## Agriculture, Forestry, and Fishery Programs
## 40
## Architecture and Building Programs
## 6
## Arts Programs
## 4
## Basic Programs
## 6
## Business and Administration Programs
## 212
## Computing/Information Technology Programs
## 51
## Elementary Graduate
## 1261
## Engineering and Engineering trades Programs
## 81
## Engineering and Engineering Trades Programs
## 147
## Environmental Protection Programs
## 2
## First Year College
## 145
## First Year High School
## 209
## First Year Post Secondary
## 20
## Fourth Year College
## 17
## Grade 1
## 152
## Grade 2
## 263
## Grade 3
## 309
## Grade 4
## 366
## Grade 5
## 348
## Grade 6
## 49
## Health Programs
## 66
## High School Graduate
## 1665
## Humanities Programs
## 9
## Journalism and Information Programs
## 7
## Law Programs
## 5
## Life Sciences Programs
## 4
## Manufacturing and Processing Programs
## 3
## Mathematics and Statistics Programs
## 2
## No Grade Completed
## 195
## Other Programs in Education at the Third Level, First Stage, of the Type that Leads to an Award not Equivalent to a First University or Baccalaureate Degree
## 13
## Other Programs of Education at the Third Level, First Stage, of the Type that Leads to a Baccalaureate or First University/Professional Degree (HIgher Education Level, First Stage, or Collegiate Education Level)
## 0
## Personal Services Programs
## 22
## Physical Sciences Programs
## 3
## Post Baccalaureate
## 39
## Preschool
## 3
## Second Year College
## 216
## Second Year High School
## 363
## Second Year Post Secondary
## 22
## Security Services Programs
## 52
## Social and Behavioral Science Programs
## 23
## Social Services Programs
## 0
## Teacher Training and Education Sciences Programs
## 159
## Third Year College
## 167
## Third Year High School
## 234
## Transport Services Programs
## 39
## Veterinary Programs
## 1
Seguidamente, serán visualizadas las frecuencias relativas, esta vez utilizando la función prop.table()
Region:
# ----- Frecuencias absolutas y relativas ------
# Frecuencias relativas - función prop.table()
prop.table(table(var_train_cat$Region))
##
## ARMM CAR Caraga
## 0.05357143 0.04271429 0.04271429
## I - Ilocos Region II - Cagayan Valley III - Central Luzon
## 0.05728571 0.05242857 0.07600000
## IVA - CALABARZON IVB - MIMAROPA IX - Zasmboanga Peninsula
## 0.09400000 0.02757143 0.04557143
## NCR V - Bicol Region VI - Western Visayas
## 0.10342857 0.05885714 0.07100000
## VII - Central Visayas VIII - Eastern Visayas X - Northern Mindanao
## 0.05942857 0.05671429 0.04228571
## XI - Davao Region XII - SOCCSKSARGEN
## 0.06171429 0.05471429
Main.Source.of.Income:
prop.table(table(var_train_cat$Main.Source.of.Income))
##
## Other sources of Income Enterpreneurial Activities
## 0.2557143 0.2530000
## Wage/Salaries
## 0.4912857
Household.Head.Job.or.Business.Indicator
prop.table(table(var_train_cat$Household.Head.Job.or.Business.Indicator))
##
## No Job/Business With Job/Business
## 0.1772857 0.8227143
Household.Head.Sex
prop.table(table(var_train_cat$Household.Head.Sex))
##
## Female Male
## 0.2161429 0.7838571
Household.Head.Marital.Status
prop.table(table(var_train_cat$Household.Head.Marital.Status))
##
## Single Widowed Annulled Divorced/Separated
## 0.0455714286 0.1632857143 0.0005714286 0.0318571429
## Married
## 0.7587142857
Household.Head.Job.or.Business.Indicator
prop.table(table(var_train_cat$Household.Head.Job.or.Business.Indicator))
##
## No Job/Business With Job/Business
## 0.1772857 0.8227143
Household.Head.Class.of.Worker
prop.table(table(var_train_cat$Household.Head.Class.of.Worker))
##
## Worked without pay in own family-operated farm or business
## 0.008682063
## Employer in own family-operated farm or business
## 0.075360306
## Worked with pay in own family-operated farm or business
## 0.000694565
## Self-employed wihout any employee
## 0.400069457
## Worked for private household
## 0.024136135
## Worked for private establishment
## 0.406146901
## Worked for government/government corporation
## 0.084910575
Type.of.Household
prop.table(table(var_train_cat$Type.of.Household))
##
## Single Family Two or More Nonrelated Persons/Members
## 0.684285714 0.003428571
## Extended Family
## 0.312285714
Type.of.Building.House
prop.table(table(var_train_cat$Type.of.Building.House))
##
## Other building unit (e.g. cave, boat)
## 0.0000000000
## Institutional living quarter
## 0.0002857143
## Commercial/industrial/agricultural building
## 0.0010000000
## Single house
## 0.9405714286
## Duplex
## 0.0247142857
## Multi-unit residential
## 0.0334285714
Type.of.Roof
prop.table(table(var_train_cat$Type.of.Roof))
##
## Salvaged/makeshift materials
## 0.004572082
## Light material (cogon,nipa,anahaw)
## 0.120731533
## Mixed but predominantly salvaged materials
## 0.001571653
## Mixed but predominantly light materials
## 0.019002715
## Mixed but predominantly strong materials
## 0.049292756
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)
## 0.804829261
Type.of.Walls
prop.table(table(var_train_cat$Type.of.Walls))
##
## Salvaged Very Light Light Strong Quite Strong
## 0.01157804 0.03687822 0.19525443 0.67195540 0.08433391
Tenure.Status
prop.table(table(var_train_cat$Tenure.Status))
##
## Own house, rent lot
## 0.011583012
## Own house, rent-free lot with consent of owner
## 0.143715144
## Own house, rent-free lot without consent of owner
## 0.021593022
## Own or owner-like possession of house and lot
## 0.716573717
## Rent house/room including lot
## 0.055627056
## Rent-free house and lot with consent of owner
## 0.048048048
## Rent-free house and lot without consent of owner
## 0.002860003
Toilet.Facilities
prop.table(table(var_train_cat$Toilet.Facilities))
##
## None
## 0.036428571
## Others
## 0.009285714
## Open pit
## 0.029428571
## Closed pit
## 0.051571429
## Water-sealed, other depository, shared with other household
## 0.019000000
## Water-sealed, other depository, used exclusively by household
## 0.061142857
## Water-sealed, sewer septic tank, shared with other household
## 0.092285714
## Water-sealed, sewer septic tank, used exclusively by household
## 0.700857143
Electricity
prop.table(table(var_train_cat$Electricity))
##
## No Si
## 0.1075714 0.8924286
Main.Source.of.Water.Supply
prop.table(table(var_train_cat$Main.Source.of.Water.Supply))
##
## Others Dug well
## 0.002285714 0.094714286
## Lake, river, rain and others Unprotected spring, river, stream, etc
## 0.011571429 0.015000000
## Protected spring, river, stream, etc Tubed/piped shallow well
## 0.065571429 0.035857143
## Shared, tubed/piped deep well Own use, tubed/piped deep well
## 0.144571429 0.110714286
## Peddler Shared, faucet, community water system
## 0.021000000 0.109857143
## Own use, faucet, community water system
## 0.388857143
Number.of.Motorcycle.Tricycle
prop.table(table(var_train_cat$Number.of.Motorcycle.Tricycle))
## numeric(0)
Household.Head.Highest.Grade.Completed
prop.table(table(var_train_cat$Household.Head.Highest.Grade.Completed))
##
## Agriculture, Forestry, and Fishery Programs
## 0.0057142857
## Architecture and Building Programs
## 0.0008571429
## Arts Programs
## 0.0005714286
## Basic Programs
## 0.0008571429
## Business and Administration Programs
## 0.0302857143
## Computing/Information Technology Programs
## 0.0072857143
## Elementary Graduate
## 0.1801428571
## Engineering and Engineering trades Programs
## 0.0115714286
## Engineering and Engineering Trades Programs
## 0.0210000000
## Environmental Protection Programs
## 0.0002857143
## First Year College
## 0.0207142857
## First Year High School
## 0.0298571429
## First Year Post Secondary
## 0.0028571429
## Fourth Year College
## 0.0024285714
## Grade 1
## 0.0217142857
## Grade 2
## 0.0375714286
## Grade 3
## 0.0441428571
## Grade 4
## 0.0522857143
## Grade 5
## 0.0497142857
## Grade 6
## 0.0070000000
## Health Programs
## 0.0094285714
## High School Graduate
## 0.2378571429
## Humanities Programs
## 0.0012857143
## Journalism and Information Programs
## 0.0010000000
## Law Programs
## 0.0007142857
## Life Sciences Programs
## 0.0005714286
## Manufacturing and Processing Programs
## 0.0004285714
## Mathematics and Statistics Programs
## 0.0002857143
## No Grade Completed
## 0.0278571429
## Other Programs in Education at the Third Level, First Stage, of the Type that Leads to an Award not Equivalent to a First University or Baccalaureate Degree
## 0.0018571429
## Other Programs of Education at the Third Level, First Stage, of the Type that Leads to a Baccalaureate or First University/Professional Degree (HIgher Education Level, First Stage, or Collegiate Education Level)
## 0.0000000000
## Personal Services Programs
## 0.0031428571
## Physical Sciences Programs
## 0.0004285714
## Post Baccalaureate
## 0.0055714286
## Preschool
## 0.0004285714
## Second Year College
## 0.0308571429
## Second Year High School
## 0.0518571429
## Second Year Post Secondary
## 0.0031428571
## Security Services Programs
## 0.0074285714
## Social and Behavioral Science Programs
## 0.0032857143
## Social Services Programs
## 0.0000000000
## Teacher Training and Education Sciences Programs
## 0.0227142857
## Third Year College
## 0.0238571429
## Third Year High School
## 0.0334285714
## Transport Services Programs
## 0.0055714286
## Veterinary Programs
## 0.0001428571
Al examinar las visualizaciones, la variable categórica electricity llama la atención. Se procede a comparar la variable electricity por regiones, ya que puede dar una idea acerca de en qué regiones puede existir mayor nivel de pobreza. Esto se realiza mediante la función cross-table, que nos mostrará las frecuencias absolutas, relativas en relación a la fila, frecuencias relativas en relación a la columna y frecuencias relativas globales:
CrossTable(var_train_cat$Region, var_train_cat$Electricity, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 7000
##
##
## | var_train_cat$Electricity
## var_train_cat$Region | No | Si | Row Total |
## --------------------------|-----------|-----------|-----------|
## ARMM | 162 | 213 | 375 |
## | 0.432 | 0.568 | 0.054 |
## | 0.215 | 0.034 | |
## | 0.023 | 0.030 | |
## --------------------------|-----------|-----------|-----------|
## CAR | 18 | 281 | 299 |
## | 0.060 | 0.940 | 0.043 |
## | 0.024 | 0.045 | |
## | 0.003 | 0.040 | |
## --------------------------|-----------|-----------|-----------|
## Caraga | 24 | 275 | 299 |
## | 0.080 | 0.920 | 0.043 |
## | 0.032 | 0.044 | |
## | 0.003 | 0.039 | |
## --------------------------|-----------|-----------|-----------|
## I - Ilocos Region | 20 | 381 | 401 |
## | 0.050 | 0.950 | 0.057 |
## | 0.027 | 0.061 | |
## | 0.003 | 0.054 | |
## --------------------------|-----------|-----------|-----------|
## II - Cagayan Valley | 23 | 344 | 367 |
## | 0.063 | 0.937 | 0.052 |
## | 0.031 | 0.055 | |
## | 0.003 | 0.049 | |
## --------------------------|-----------|-----------|-----------|
## III - Central Luzon | 13 | 519 | 532 |
## | 0.024 | 0.976 | 0.076 |
## | 0.017 | 0.083 | |
## | 0.002 | 0.074 | |
## --------------------------|-----------|-----------|-----------|
## IVA - CALABARZON | 26 | 632 | 658 |
## | 0.040 | 0.960 | 0.094 |
## | 0.035 | 0.101 | |
## | 0.004 | 0.090 | |
## --------------------------|-----------|-----------|-----------|
## IVB - MIMAROPA | 26 | 167 | 193 |
## | 0.135 | 0.865 | 0.028 |
## | 0.035 | 0.027 | |
## | 0.004 | 0.024 | |
## --------------------------|-----------|-----------|-----------|
## IX - Zasmboanga Peninsula | 53 | 266 | 319 |
## | 0.166 | 0.834 | 0.046 |
## | 0.070 | 0.043 | |
## | 0.008 | 0.038 | |
## --------------------------|-----------|-----------|-----------|
## NCR | 6 | 718 | 724 |
## | 0.008 | 0.992 | 0.103 |
## | 0.008 | 0.115 | |
## | 0.001 | 0.103 | |
## --------------------------|-----------|-----------|-----------|
## V - Bicol Region | 47 | 365 | 412 |
## | 0.114 | 0.886 | 0.059 |
## | 0.062 | 0.058 | |
## | 0.007 | 0.052 | |
## --------------------------|-----------|-----------|-----------|
## VI - Western Visayas | 67 | 430 | 497 |
## | 0.135 | 0.865 | 0.071 |
## | 0.089 | 0.069 | |
## | 0.010 | 0.061 | |
## --------------------------|-----------|-----------|-----------|
## VII - Central Visayas | 47 | 369 | 416 |
## | 0.113 | 0.887 | 0.059 |
## | 0.062 | 0.059 | |
## | 0.007 | 0.053 | |
## --------------------------|-----------|-----------|-----------|
## VIII - Eastern Visayas | 64 | 333 | 397 |
## | 0.161 | 0.839 | 0.057 |
## | 0.085 | 0.053 | |
## | 0.009 | 0.048 | |
## --------------------------|-----------|-----------|-----------|
## X - Northern Mindanao | 44 | 252 | 296 |
## | 0.149 | 0.851 | 0.042 |
## | 0.058 | 0.040 | |
## | 0.006 | 0.036 | |
## --------------------------|-----------|-----------|-----------|
## XI - Davao Region | 48 | 384 | 432 |
## | 0.111 | 0.889 | 0.062 |
## | 0.064 | 0.061 | |
## | 0.007 | 0.055 | |
## --------------------------|-----------|-----------|-----------|
## XII - SOCCSKSARGEN | 65 | 318 | 383 |
## | 0.170 | 0.830 | 0.055 |
## | 0.086 | 0.051 | |
## | 0.009 | 0.045 | |
## --------------------------|-----------|-----------|-----------|
## Column Total | 753 | 6247 | 7000 |
## | 0.108 | 0.892 | |
## --------------------------|-----------|-----------|-----------|
##
##
Se incorpora al análisis una tercera variable que suscita interés en el estudio: la variable Sex, que indica el sexo de la persona que toma las decisiones en el hogar.
# ----- Estudio de frecuencias multidimensionales -----
# Análisis de la variable electricity/región/sexo
ftable(var_train_cat$Region, var_train_cat$Household.Head.Sex, var_train_cat$Electricity)
## No Si
##
## ARMM Female 9 18
## Male 153 195
## CAR Female 1 65
## Male 17 216
## Caraga Female 4 46
## Male 20 229
## I - Ilocos Region Female 7 97
## Male 13 284
## II - Cagayan Valley Female 1 50
## Male 22 294
## III - Central Luzon Female 4 124
## Male 9 395
## IVA - CALABARZON Female 9 156
## Male 17 476
## IVB - MIMAROPA Female 1 36
## Male 25 131
## IX - Zasmboanga Peninsula Female 7 51
## Male 46 215
## NCR Female 1 194
## Male 5 524
## V - Bicol Region Female 8 86
## Male 39 279
## VI - Western Visayas Female 14 106
## Male 53 324
## VII - Central Visayas Female 10 106
## Male 37 263
## VIII - Eastern Visayas Female 9 77
## Male 55 256
## X - Northern Mindanao Female 8 53
## Male 36 199
## XI - Davao Region Female 11 73
## Male 37 311
## XII - SOCCSKSARGEN Female 11 60
## Male 54 258
Tras haber realizado un análisis inicial de cada variable del data set, es hora de visualizar algunos de los datos que más llamativos. De las variables preseleccionadas, se destacan los ingresos, la región y el tipo de trabajo como puntos clave para modelar mejor la distribución de esta población.
Más adelante, cuando sean visualizados algunos de los datos de las variables combinados, y se planteen algunas preguntas de interés, podrán sacarse algunas conclusiones sobre las características más relevantes de la población filipina (véase apartado 4.4 Representación datos cualitativos y cuantitativos).
# ----- Gráficos EDA con variables cualitativas individuales -----
ggplot(datos, aes(Region)) + geom_bar() + ggtitle("Núm. familias. por Región") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(datos, aes(Main.Source.of.Income)) + geom_bar() + ggtitle("Núm. familias. por fuente de ingresos")
# ----- Visualización de datos cualitativos -----
barplot(table(datos$Region), col = c("lightblue","yellow", "cadetblue4"),
main = "Diagrama de barras de las frecuencias absolutas\n de la variable \"Region\"")
barplot(prop.table(table(datos$Household.Head.Class.of.Worker,datos$Main.Source.of.Income)),
beside = TRUE, col = c("chocolate","cornsilk1","cornflowerblue","blueviolet", "darkgoldenrod1", "coral", "brown", "chartreuse4"),
legend.text = T, main = "Frecuencias relativas de fuente de\n ingresos por tipo de trabajo",
ylim = c(0,1))
Se dispone a ver la distribución y densidad de cada una de las variables cuantitativas sin transformar, es decir, las variables “en crudo”. De esta manera, se pretende identificar aquellas con los datos más sesgados, y poder observar las distribuciones y rangos que presentan. Se irán visualizando las distribuciones de las variables, agrupadas de 6 en 6 para facilitar la visualización.
# ----- Histograma de las variables cuantitativas sin transformar -----
# Primeras 6 variables
p1 <- qplot(var_train_num$Total.Household.Income,
geom="histogram",
binwidth = 10000,
main = "Histogram for Total Household Income",
xlab = "Total Household Income",
fill=I("blue"),
col=I("red"),
xlim=c(10000,12000000))
p2 <- qplot(var_train_num$Total.Food.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Total Food Expenditure",
xlab = "Total Food Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(2000,800000))
p3 <- qplot(var_train_num$Bread.and.Cereals.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Bread.and.Cereals.Expenditure",
xlab = "Bread.and.Cereals.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,350000))
p4 <- qplot(var_train_num$Total.Rice.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Total Rice Expenditure",
xlab = "Total Rice Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,350000))
p5 <- qplot(var_train_num$Meat.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Meat.Expenditure",
xlab = "Meat.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,270000))
p6 <- qplot(var_train_num$Total.Fish.and..marine.products.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Total.Fish.and..marine.products.Expenditure",
xlab = "Total.Fish.and..marine.products.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,190000))
grid.arrange(p1,p2,p3,p4,p5,p6,nrow=2)
# Siguientes 6 variables
d1 <- qplot(var_train_num$Fruit.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Fruit.Expenditure",
xlab = "Fruit.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,70000))
d2 <- qplot(var_train_num$Vegetables.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Vegetables.Expenditure",
xlab = "Vegetables.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,80000))
d3 <- qplot(var_train_num$Restaurant.and.hotels.Expenditure,
geom="histogram",
binwidth = 5000,
main = "Histogram for Restaurant.and.hotels.Expenditure",
xlab = "Restaurant.and.hotels.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-5000,520000))
d4 <- qplot(var_train_num$Alcoholic.Beverages.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Alcoholic.Beverages.Expenditure",
xlab = "Alcoholic.Beverages.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,36000))
d5 <- qplot(var_train_num$Tobacco.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Tobacco.Expenditure",
xlab = "Tobacco.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,100000))
d6 <- qplot(var_train_num$Clothing..Footwear.and.Other.Wear.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Clothing..Footwear.and.Other.Wear.Expenditure",
xlab = "Clothing..Footwear.and.Other.Wear.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,360000))
grid.arrange(d1,d2,d3,d4,d5,d6,nrow=2)
# Siguientes 6 variables
h1 <- qplot(var_train_num$Housing.and.water.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Housing.and.water.Expenditure",
xlab = "Housing.and.water.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(2000,842000))
h2 <- qplot(var_train_num$Imputed.House.Rental.Value,
geom="histogram",
binwidth = 1000,
main = "Histogram for Imputed.House.Rental.Value",
xlab = "Imputed.House.Rental.Value",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,730000))
h3 <- qplot(var_train_num$Medical.Care.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Medical.Care.Expenditure",
xlab = "Medical.Care.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,1000000))
h4 <- qplot(var_train_num$Transportation.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Transportation.Expenditure",
xlab = "Transportation.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,500000))
h5 <- qplot(var_train_num$Communication.Expenditure,
geom="histogram",
binwidth = 1000,
main = "Histogram for Communication.Expenditure",
xlab = "Communication.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-1000,100000))
h6 <- qplot(var_train_num$Education.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Education.Expenditure",
xlab = "Education.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,340000))
grid.arrange(h1,h2,h3,h4,h5,h6,nrow=2)
# Siguientes 6 variables
s1 <- qplot(var_train_num$Miscellaneous.Goods.and.Services.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Miscellaneous.Goods.and.Services.Expenditure",
xlab = "Miscellaneous.Goods.and.Services.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,320000))
s2 <- qplot(var_train_num$Special.Occasions.Expenditure,
geom="histogram",
binwidth = 10000,
main = "Histogram for Special.Occasions.Expenditure",
xlab = "Special.Occasions.Expenditure",
fill=I("blue"),
col=I("red"),
xlim=c(-10000,310000))
s3 <- qplot(var_train_num$Crop.Farming.and.Gardening.expenses,
geom="histogram",
binwidth = 100000,
main = "Histogram for Crop.Farming.and.Gardening.expenses",
xlab = "Crop.Farming.and.Gardening.expenses",
fill=I("blue"),
col=I("red"),
xlim=c(-100000,3800000))
s4 <- qplot(var_train_num$Total.Income.from.Entrepreneurial.Acitivites,
geom="histogram",
binwidth = 100000,
main = "Histogram for Total.Income.from.Entrepreneurial.Acitivites",
xlab = "Total.Income.from.Entrepreneurial.Acitivites",
fill=I("blue"),
col=I("red"),
xlim=c(-100000,4800000))
s5 <- qplot(var_train_num$Household.Head.Age,
geom="histogram",
binwidth = 5,
main = "Histogram for Household.Head.Age",
xlab = "Household.Head.Age",
fill=I("blue"),
col=I("red"),
xlim=c(10,100))
s6 <- qplot(var_train_num$Total.Number.of.Family.members,
geom="histogram",
binwidth = 1,
main = "Histogram for Total.Number.of.Family.members",
xlab = "Total.Number.of.Family.members",
fill=I("blue"),
col=I("red"),
xlim=c(-1,23))
grid.arrange(s1,s2,s3,s4,s5,s6,nrow=2)
# Otras 6 variables
g1 <- qplot(var_train_num$Total.number.of.family.members.employed,
geom="histogram",
binwidth = 1,
main = "Histogram for Total.number.of.family.members.employed",
xlab = "Total.number.of.family.members.employed",
fill=I("blue"),
col=I("red"),
xlim=c(-1,10))
g2 <- qplot(var_train_num$House.Floor.Area,
geom="histogram",
binwidth = 25,
main = "Histogram for House.Floor.Area",
xlab = "House.Floor.Area",
fill=I("blue"),
col=I("red"),
xlim=c(-25,1000))
g3 <- qplot(var_train_num$House.Age,
geom="histogram",
binwidth = 5,
main = "Histogram for House.Age",
xlab = "House.Age",
fill=I("blue"),
col=I("red"),
xlim=c(-1,130))
g4 <- qplot(var_train_num$Number.of.bedrooms,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.bedrooms",
xlab = "Number.of.bedrooms",
fill=I("blue"),
col=I("red"),
xlim=c(-1,10))
g5 <- qplot(var_train_num$Number.of.Television,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Television",
xlab = "Number.of.Television",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
g6 <- qplot(var_train_num$Number.of.CD.VCD.DVD,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.CD.VCD.DVD",
xlab = "Number.of.CD.VCD.DVD",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
grid.arrange(g1,g2,g3,g4,g5,g6,nrow=2)
# Siguientes 6 variables
o1 <- qplot(var_train_num$Number.of.Component.Stereo.set,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Component.Stereo.set",
xlab = "Number.of.Component.Stereo.set",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
o2 <- qplot(var_train_num$Number.of.Refrigerator.Freezer,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Refrigerator.Freezer",
xlab = "Number.of.Refrigerator.Freezer",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
o3 <- qplot(var_train_num$Number.of.Washing.Machine,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Washing.Machine",
xlab = "Number.of.Washing.Machine",
fill=I("blue"),
col=I("red"),
xlim=c(-1,5))
o4 <- qplot(var_train_num$Number.of.Airconditioner,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Airconditioner",
xlab = "Number.of.Airconditioner",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
o5 <- qplot(var_train_num$Number.of.Car..Jeep..Van,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Car..Jeep..Van",
xlab = "Number.of.Car..Jeep..Van",
fill=I("blue"),
col=I("red"),
xlim=c(-1,6))
o6 <- qplot(var_train_num$Number.of.Landline.wireless.telephones,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Landline.wireless.telephones",
xlab = "Number.of.Landline.wireless.telephones",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
grid.arrange(o1,o2,o3,o4,o5,o6,nrow=2)
# Últimas 5 variables
w1 <- qplot(var_train_num$Number.of.Cellular.phone,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Cellular.phone",
xlab = "Number.of.Cellular.phone",
fill=I("blue"),
col=I("red"),
xlim=c(-1,12))
w2 <- qplot(var_train_num$Number.of.Personal.Computer,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Personal.Computer",
xlab = "Number.of.Personal.Computer",
fill=I("blue"),
col=I("red"),
xlim=c(-1,8))
w3 <- qplot(var_train_num$Number.of.Stove.with.Oven.Gas.Range,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Stove.with.Oven.Gas.Range",
xlab = "Number.of.Stove.with.Oven.Gas.Range",
fill=I("blue"),
col=I("red"),
xlim=c(-1,4))
w4 <- qplot(var_train_num$Number.of.Motorized.Banca,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Motorized.Banca",
xlab = "Number.of.Motorized.Banca",
fill=I("blue"),
col=I("red"),
xlim=c(-1,5))
w5 <- qplot(var_train_num$Number.of.Motorcycle.Tricycle,
geom="histogram",
binwidth = 1,
main = "Histogram for Number.of.Motorcycle.Tricycle",
xlab = "Number.of.Motorcycle.Tricycle",
fill=I("blue"),
col=I("red"),
xlim=c(-1,7))
grid.arrange(w1,w2,w3,w4,w5,nrow=2)
Puede observarse que la mayoría de las variables están sesgadas a la derecha, una característica común cuando tratamos con datos socioeconómicos. A simple vista, muy pocas tienen una distribución simétrica, como sea el caso de la distribución normal de la variable “House.Age”.
Como parte final del análisis exploratorio de datos, se muestran algunas visualizaciones interesantes sobre el tipo de población filipina en 2017. Viendo estas gráficas, podría afirmarse que se trata de una población mayormente agraria, en la que abundan los trabajos de campo. Además, las familias de la Región NAT son aquellas que más gastos tienen.
Para esta sección, han sido planteadas una serie de preguntas interesantes que pueden ser respondidas a través de visualizaciones:
¿Qué porcentaje de la población tiene unos ingresos inferiores a 400000$ anuales
La mayoría ronda por debajo de los 400000 pesos filipinos por familia al año (en torno al 85%). A continuación, se realiza una subdivisión del conjunto de datos en 3 subconjuntos: familias con ingresos inferiores a 400000 pesos filipinos, familias con ingresos entre 400000 y 1100000 pesos filipinos, y el resto. Mediante diagramas de caja (boxplot) se observa como esta división permite tener 3 grupos de familias en función de sus ingresos sin valores atípicos (gráfica 2 y 3), pese a observar mayor dispersión (desigualdad de ingresos) en el grupo más favorecido económicamente.
grupo1 <- datos_training %>%
filter(Total.Household.Income < 400000)
ceros_restaurant <- datos_training %>% filter(Restaurant.and.hotels.Expenditure < 1)
grupo2y3 <- datos_training %>%
filter(Total.Household.Income >= 400000)
grupo2 <- grupo2y3 %>%
filter(Total.Household.Income < 1000000)
grupo3 <- grupo2y3 %>%
filter(Total.Household.Income >= 1000000)
boxplot(datos_training$Total.Household.Income,
border = c("red"),
title="Diagrama de caja del conjunto de datos")
boxplot(grupo1$Total.Household.Income,
border = c("red"),
title="Diagrama de caja del grupo más desfavorecido económicamente")
boxplot(grupo2$Total.Household.Income,
border = c("blue"),
title="Diagrama de caja del grupo con ingresos 'medios'")
boxplot(grupo3$Total.Household.Income,
border = c("green"),
title="Diagrama de caja del grupo más favorecido económicamente")
Cuál es la relación entre el consumo de comida y ropa respecto a los ingresos familiares?
Se decide filtrar por regiones y visualizar la cantidad de ingresos por familia a partir del tamaño de las burbujas.
Bubble plot para ver la relación entre consumo de comida y de ropa respecto a los ingresos (tamaño de la burbuja) dividido por regiones (ciudades)
datos_training %>%
ggplot(aes(x=Clothing..Footwear.and.Other.Wear.Expenditure, y=Total.Food.Expenditure, size = Total.Household.Income, color = Region, scientific=F)) +
geom_point(alpha=0.5) +
scale_size(range = c(.1, 24), name="Ingresos")
¿Cuáles son las profesiones más comunes de una familia Filipina?
Como se indicaba al principio de esta sección, la mayoría es una población agraria.
# Profesiones más comúnes en Filipinas
by_common_jobs <- datos_occupation %>%na.omit()%>%
group_by(Household.Head.Occupation) %>%
summarise(Total = n()) %>%
arrange(desc(Total)) %>%
head(20) %>% ungroup()
ggplot(data = by_common_jobs) + geom_bar(mapping = aes(x = Household.Head.Occupation, y = Total), stat = "identity") + labs(title="Trabajos más comunes en familias filipinas") + theme(axis.text.x = element_text(angle = 30, hjust = 1))
¿Cuáles son las regiones de Filipinas con más gastos?
Los datos tienen un gran número de valores atípicos, por lo que para ver el boxplot es necesario aplicar un logaritmo en base 10. Vemos que las familias de las regiones CAR y NCR son aquellas que más gastos tienen
# Región y gastos
by_region_educ <- datos_occupation %>%
group_by(Region, Education.Expenditure, Housing.and.water.Expenditure) %>%
summarise(Total = n()) %>%
arrange(desc(Total)) %>% ungroup()
# Para ver el boxplot es necesario transformar la variable
ggplot(by_region_educ, aes(x=Region, y=Education.Expenditure)) + geom_boxplot(color="black", fill="orange", alpha = 0.6) + scale_y_log10() + labs(title="Gasto de educación por regiones") + theme(axis.text.x = element_text(angle = 30, hjust = 1))
Una vez hecho el análisis EDA, con un mejor conocimiento de los datos disponibles, es hora de empezar a prepararlos para diseñar el modelo.
El primer paso es un diagnóstico de valores faltantes, que tendremos que imputar con valores factibles.
Se recuerda que, a partir de ahora, se trabajará con el conjunto de datos train, ya que los datos test no serán utilizados hasta la última parte de este trabajo. Además, a la vista del apartado 4.4., se decide trabajar con el grupo 1, para evitar tener una segmentación de la población.
Cálculo del número total de NA en el conjunto de datos de train
datos_training <- grupo1
var_train_cat <- datos_training%>%select_if(is.factor)
var_train_num <- datos_training%>%select_if(is.numeric)
# ----- Detección e imputación de datos faltantes -----
# Cálculo del número total de NA en el conjunto de datos de train
length(which(is.na(datos_training)))
## [1] 972
Cálculo del número total de filas que contienen al menos un NA en el conjunto de datos de train
# Cálculo del número total de filas que contienen al menos un NA en el conjunto de datos de train
length(which(!complete.cases(datos_training)))
## [1] 969
Se observa que existen bastantes valores NA en el conjunto, pero todos corresponden a las variables cualitativas. Se muestra gráficamente como se distribuyenlos NA en el conjunto de datos correspondiente a las variables cualitativas.
Número de NA en el conjunto de variables cuantitativas
length(which(is.na(var_train_num)))
## [1] 0
Número de NA en el conjunto de variables cualitativas
length(which(is.na(var_train_cat)))
## [1] 972
# Número de NA en el conjunto de variables cuantitativas y en el conjunto de las cualitativas
# Visualización gráfica de la distribución de NA en el conjunto de datos correspondiente a las variables cualitativas
aggr_plot<-aggr(var_train_cat
,numbers=TRUE,sortVars=TRUE,
labels=names(var_train_cat)
,cex.axis=.7,gap=3
,ylab=c('Histograma de datos faltantes','Patrones de datos faltantes'),
only.miss=TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## Household.Head.Class.of.Worker 0.1620573356
## Tenure.Status 0.0010118044
## Type.of.Walls 0.0006745363
## Type.of.Roof 0.0001686341
## Region 0.0000000000
## Main.Source.of.Income 0.0000000000
## Household.Head.Sex 0.0000000000
## Household.Head.Marital.Status 0.0000000000
## Household.Head.Highest.Grade.Completed 0.0000000000
## Household.Head.Job.or.Business.Indicator 0.0000000000
## Type.of.Household 0.0000000000
## Type.of.Building.House 0.0000000000
## Toilet.Facilities 0.0000000000
## Electricity 0.0000000000
## Main.Source.of.Water.Supply 0.0000000000
# Tabla de contingencias de las variables cuyos NA serán imputados
table_pre_Tenure<-prop.table(table(var_train_cat$Tenure.Status))
table_pre_Worker<-prop.table(table(var_train_cat$Household.Head.Class.of.Worker))
table_pre_Walls<-prop.table(table(var_train_cat$Type.of.Walls))
table_pre_Roof<-prop.table(table(var_train_cat$Type.of.Roof))
# Summary de las 4 variables cuyos NA serán imputados
summary_Tenure <- summary(var_train_cat$Tenure.Status)
summary_Worker <- summary(var_train_cat$Household.Head.Class.of.Worker)
summary_Walls<-summary(var_train_cat$Type.of.Walls)
summary_Roof<-summary(var_train_cat$Type.of.Roof)
Al decidir qué método de imputación de datos faltantes utilizar, es conveniente tener en cuenta que se está trabajando tratando con variables categóricas, y que el modelo a diseñar será una regresión lineal múltiple.
Por ello, una buena opción es el método no lineal KNN (k nearest neighbors), el cual calcula la distancia del elemento nuevo a cada uno de los existentes, y ordena dichas distancias de menor a mayor para ir seleccionando el grupo al que pertenece. Por lo tanto, dicho grupo será aquel que tenga una menor distacia con la mayor frecuencia.
# Imputación de los valores NA usando el método no lineal kNN (k nearest neighbors)
var_train_cat <- VIM::kNN(var_train_cat,variable='Tenure.Status',impNA=TRUE)
var_train_cat$Tenure.Status_imp<-NULL
var_train_cat <- VIM::kNN(var_train_cat,variable='Household.Head.Class.of.Worker',impNA=TRUE)
var_train_cat$Household.Head.Class.of.Worker_imp<-NULL
var_train_cat <- VIM::kNN(var_train_cat,variable='Type.of.Walls',impNA=TRUE)
var_train_cat$Type.of.Walls_imp<-NULL
var_train_cat <- VIM::kNN(var_train_cat,variable='Type.of.Roof',impNA=TRUE)
var_train_cat$Type.of.Roof_imp<-NULL
# Comprobación de que se han eliminado todos los NA del conjunto de variables categóricas
length(which(is.na(var_train_cat)))
## [1] 0
# Calculamos las tablas de contingencia tras haber imputado los NA con kNN
table_pos_Tenure<-prop.table(table(var_train_cat$Tenure.Status))
table_pos_Worker<-prop.table(table(var_train_cat$Household.Head.Class.of.Worker))
table_pos_Walls<-prop.table(table(var_train_cat$Type.of.Walls))
table_pos_Roof<-prop.table(table(var_train_cat$Type.of.Roof))
Finalmente, se comprueba que las proporciones no se han visto afectadas por la imputación restando las proporciones antes y después de imputar y viendo que la variación es muy pequeña.
# Comprobación de que las proporciones no se han visto afectadas por la imputación
porc_dif_Tenure <- (table_pos_Tenure*100)-(table_pre_Tenure*100)
porc_dif_Worker <- (table_pos_Worker*100)-(table_pre_Worker*100)
porc_dif_Walls <- (table_pos_Walls*100)-(table_pre_Walls*100)
porc_dif_Roof <- (table_pos_Roof*100)-(table_pre_Roof*100)
porc_dif_Tenure
##
## Own house, rent lot
## -0.001178503
## Own house, rent-free lot with consent of owner
## -0.016242842
## Own house, rent-free lot without consent of owner
## -0.002288686
## Own or owner-like possession of house and lot
## 0.030248237
## Rent house/room including lot
## -0.004918968
## Rent-free house and lot with consent of owner
## -0.005277643
## Rent-free house and lot without consent of owner
## -0.000341595
porc_dif_Worker
##
## Worked without pay in own family-operated farm or business
## -0.130454688
## Employer in own family-operated farm or business
## -0.988672773
## Worked with pay in own family-operated farm or business
## -0.006522734
## Self-employed wihout any employee
## 1.982144269
## Worked for private household
## -0.316831132
## Worked for private establishment
## -1.112129605
## Worked for government/government corporation
## 0.572466663
porc_dif_Walls
##
## Salvaged Very Light Light Strong Quite Strong
## -0.0009219952 -0.0028798122 0.0014285235 0.0084402402 -0.0060669562
porc_dif_Roof
##
## Salvaged/makeshift materials
## -0.00009101518
## Light material (cogon,nipa,anahaw)
## 0.01448279059
## Mixed but predominantly salvaged materials
## -0.00003128647
## Mixed but predominantly light materials
## -0.00036690495
## Mixed but predominantly strong materials
## -0.00089877491
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)
## -0.01309480909
Variable porc_dif_Tenure
porc_dif_Tenure
##
## Own house, rent lot
## -0.001178503
## Own house, rent-free lot with consent of owner
## -0.016242842
## Own house, rent-free lot without consent of owner
## -0.002288686
## Own or owner-like possession of house and lot
## 0.030248237
## Rent house/room including lot
## -0.004918968
## Rent-free house and lot with consent of owner
## -0.005277643
## Rent-free house and lot without consent of owner
## -0.000341595
Variable porc_dif_Worker
porc_dif_Worker
##
## Worked without pay in own family-operated farm or business
## -0.130454688
## Employer in own family-operated farm or business
## -0.988672773
## Worked with pay in own family-operated farm or business
## -0.006522734
## Self-employed wihout any employee
## 1.982144269
## Worked for private household
## -0.316831132
## Worked for private establishment
## -1.112129605
## Worked for government/government corporation
## 0.572466663
Variable porc_dif_Walls
porc_dif_Walls
##
## Salvaged Very Light Light Strong Quite Strong
## -0.0009219952 -0.0028798122 0.0014285235 0.0084402402 -0.0060669562
Variable porc_dif_Roof
porc_dif_Roof
##
## Salvaged/makeshift materials
## -0.00009101518
## Light material (cogon,nipa,anahaw)
## 0.01448279059
## Mixed but predominantly salvaged materials
## -0.00003128647
## Mixed but predominantly light materials
## -0.00036690495
## Mixed but predominantly strong materials
## -0.00089877491
## Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)
## -0.01309480909
Para utilizar un modelo de regresión lineal múltiple, es muy conveniente que se cumplan las siquientes condiciones:
Por lo tanto, para poder aplicar un modelo de regresión multiple a las variables numéricas del presente trabajo, es necesario plantear una transformación para que se acerquen lo más posible a una distribución normal. Se recuerda que durante el análisis eda, se constató que la mayoría de las variables mostraban un sesgo a la derecha, lo cual podría estropear el diseño del modelo (distribuciones no normales).
A continuación, se procede a aplicar una transformación a las variables: el logaritmo decimal. Se irán mostrando los histogramas de cada variable tras realizar la transformación, de manera que sea posible apreciar aquellas distribuciones que se han conseguido acercar a la distribución normal.
NOTA: todos los valores iguales a 0, cuya transformación resulta en -Infinito, han sido imputados a un valor 0, para no entorpecer la visualización.
# Transformación logarítmica, que produce que los valores iguales a 0 se transformen a -Inf (por la definición del logaritmo)
var_train_num_Log<- log10(var_train_num)
# Se imputan los valores -Infinito a valor 0, para no entorpecer la visualización y el procesado
Log_sin_inf <- replace(var_train_num_Log,var_train_num_Log=="-Inf",0)
# Histogramas de las variables cuantitativas transformadas con el logaritmo
qplot(Log_sin_inf[1:7000,1],
geom="histogram",
main = "Histogram for Total Household Income",
xlab = "Total Household Income",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,2],
geom="histogram",
main = "Histogram for Total Food Expenditure",
xlab = "Total Food Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,3],
geom="histogram",
main = "Histogram for Bread.and.Cereals.Expenditure",
xlab = "Bread.and.Cereals.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,4],
geom="histogram",
main = "Histogram for Total Rice Expenditure",
xlab = "Total Rice Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,5],
geom="histogram",
main = "Histogram for Meat.Expenditure",
xlab = "Meat.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,6],
geom="histogram",
main = "Histogram for Total.Fish.and..marine.products.Expenditure",
xlab = "Total.Fish.and..marine.products.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,7],
geom="histogram",
main = "Histogram for Fruit.Expenditure",
xlab = "Fruit.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,8],
geom="histogram",
main = "Histogram for Vegetables.Expenditure",
xlab = "Vegetables.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,9],
geom="histogram",
main = "Histogram for Restaurant.and.hotels.Expenditure",
xlab = "Restaurant.and.hotels.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,10],
geom="histogram",
main = "Histogram for Alcoholic.Beverages.Expenditure",
xlab = "Alcoholic.Beverages.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,11],
geom="histogram",
main = "Histogram for Tobacco.Expenditure",
xlab = "Tobacco.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,12],
geom="histogram",
main = "Histogram for Clothing..Footwear.and.Other.Wear.Expenditure",
xlab = "Clothing..Footwear.and.Other.Wear.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,13],
geom="histogram",
main = "Histogram for Housing.and.water.Expenditure",
xlab = "Housing.and.water.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,14],
geom="histogram",
main = "Histogram for Imputed.House.Rental.Value",
xlab = "Imputed.House.Rental.Value",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,15],
geom="histogram",
main = "Histogram for Medical.Care.Expenditure",
xlab = "Medical.Care.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,16],
geom="histogram",
main = "Histogram for Transportation.Expenditure",
xlab = "Transportation.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,17],
geom="histogram",
main = "Histogram for Communication.Expenditure",
xlab = "Communication.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,18],
geom="histogram",
main = "Histogram for Education.Expenditure",
xlab = "Education.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,19],
geom="histogram",
main = "Histogram for Miscellaneous.Goods.and.Services.Expenditure",
xlab = "Miscellaneous.Goods.and.Services.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,20],
geom="histogram",
main = "Histogram for Special.Occasions.Expenditure",
xlab = "Special.Occasions.Expenditure",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,21],
geom="histogram",
main = "Histogram for Crop.Farming.and.Gardening.expenses",
xlab = "Crop.Farming.and.Gardening.expenses",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,22],
geom="histogram",
main = "Histogram for Total.Income.from.Entrepreneurial.Acitivites",
xlab = "Total.Income.from.Entrepreneurial.Acitivites",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,23],
geom="histogram",
main = "Histogram for Household.Head.Age",
xlab = "Household.Head.Age",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,24],
geom="histogram",
main = "Histogram for Total.Number.of.Family.members",
xlab = "Total.Number.of.Family.members",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,25],
geom="histogram",
main = "Histogram for Total.number.of.family.members.employed",
xlab = "Total.number.of.family.members.employed",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,26],
geom="histogram",
main = "Histogram for House.Floor.Area",
xlab = "House.Floor.Area",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,27],
geom="histogram",
main = "Histogram for House.Age",
xlab = "House.Age",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,28],
geom="histogram",
main = "Histogram for Number.of.bedrooms",
xlab = "Number.of.bedrooms",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,29],
geom="histogram",
main = "Histogram for Number.of.Television",
xlab = "Number.of.Television",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,30],
geom="histogram",
main = "Histogram for Number.of.CD.VCD.DVD",
xlab = "Number.of.CD.VCD.DVD",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,31],
geom="histogram",
main = "Histogram for Number.of.Component.Stereo.set",
xlab = "Number.of.Component.Stereo.set",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,32],
geom="histogram",
main = "Histogram for Number.of.Refrigerator.Freezer",
xlab = "Number.of.Refrigerator.Freezer",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,33],
geom="histogram",
main = "Histogram for Number.of.Washing.Machine",
xlab = "Number.of.Washing.Machine",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,34],
geom="histogram",
main = "Histogram for Number.of.Airconditioner",
xlab = "Number.of.Airconditioner",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,35],
geom="histogram",
main = "Histogram for Number.of.Car..Jeep..Van",
xlab = "Number.of.Car..Jeep..Van",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,36],
geom="histogram",
main = "Histogram for Number.of.Landline.wireless.telephones",
xlab = "Number.of.Landline.wireless.telephones",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,37],
geom="histogram",
main = "Histogram for Number.of.Cellular.phone",
xlab = "Number.of.Cellular.phone",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,38],
geom="histogram",
main = "Histogram for Number.of.Personal.Computer",
xlab = "Number.of.Personal.Computer",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,39],
geom="histogram",
main = "Histogram for Number.of.Stove.with.Oven.Gas.Range",
xlab = "Number.of.Stove.with.Oven.Gas.Range",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,40],
geom="histogram",
main = "Histogram for Number.of.Motorized.Banca",
xlab = "Number.of.Motorized.Banca",
fill=I("blue"),
col=I("red"))
qplot(Log_sin_inf[1:7000,41],
geom="histogram",
main = "Histogram for Number.of.Motorcycle.Tricycle",
xlab = "Number.of.Motorcycle.Tricycle",
fill=I("blue"),
col=I("red"))
Con esta transformación, se consiguen buenos resultados en general. Muchas variables adquieren distribuciones normales o casi normales, lo que permitirá poder utilizarlas en el diseño del modelo. Sin embargo, hay otras que supondrían un problema, pues tienen distribuciones muy asimétricas.
Algunas variables presentan un porcentaje muy alto de valores iguales a 0, lo que produce un polo en el extremo izquierdo de la distribución. Esto es debido a que la población filipina cuenta con un gran número de familias que viven en condiciones extremas de pobreza (aunque desde 2018, su situación económica está mejorando considerablemente).
En el siguiente apartado, se tendrá en cuenta lo analizado, para descartar aquellas variables cuya distribución no encaje con los requisitos, y partiendo del conjunto de datos transformados logarítmicamente.
Recordamos las condiciones óptimas de cualquier modelo de regresión lineal múltiple:
Por lo tanto, y a la vista del apartado anterior, se partirá del conjunto de variables transformadas logarítmicamente. Además, serán descartadas aquellas variables con distribuciones claramente asimétricas. Después, mediante un análisis de la correlación entre pares de las variables matrices, se rechazarán las variables altamente correladas entre sí a la hora de diseñar el modelo de regresión lineal múltiple.
# ----- Descarte de variables que no tienen distribuciones normales/simétricas -----
#-Imputed.House.Rental.Value,
#-Restaurant.and.hotels.Expenditure,
Log_reduced <- Log_sin_inf%>%select(
-Imputed.House.Rental.Value,
-Alcoholic.Beverages.Expenditure,
-Tobacco.Expenditure,
-Restaurant.and.hotels.Expenditure,
-Medical.Care.Expenditure,
-Communication.Expenditure,
-Education.Expenditure,
-Total.number.of.family.members.employed,
-Special.Occasions.Expenditure,
-Crop.Farming.and.Gardening.expenses,
-Total.Income.from.Entrepreneurial.Acitivites,
-Number.of.bedrooms,
-Number.of.Television,
-Number.of.CD.VCD.DVD,
-Number.of.Component.Stereo.set,
-Number.of.Refrigerator.Freezer,
-Number.of.Washing.Machine,
-Number.of.Airconditioner,
-Number.of.Car..Jeep..Van,
-Number.of.Landline.wireless.telephones,
-Number.of.Cellular.phone,
-Number.of.Personal.Computer,
-Number.of.Stove.with.Oven.Gas.Range,
-Number.of.Motorized.Banca,
-Number.of.Motorcycle.Tricycle)
# Cálculo de la matriz de correlaciones cruzadas
cor_matrix_log_reduced <- round(cor(Log_reduced),4)
Una vez seleccionadas solo aquellas variables que tienen distribuciones simétricas, se procede a mostrar un mapa de calor de la matriz de correlaciones cruzadas:
#----- Mapa de calor de la matriz de correlaciones cruzadas----------
mapa_corr <- melt(cor_matrix_log_reduced)
ggplot(data = mapa_corr, aes(x =X1, y =X2, fill =value)) + geom_tile() + theme(axis.text.x = element_text(angle = 60, vjust= 1, size = 6, hjust = 1)) + theme(axis.text.y = element_text( vjust= 1, size = 5, hjust = 1))
Es conveniente evitar variables altamente correlacionadas entre sí, descartando de cada par la que más correlada esté con todas las demás. En el análisis no se incluirá la variable a predecir (“Total.Household.Income”).
Se aplica un umbral de valor absoluto igual a 0.5, para filtar aquellas variables correladas entre sí por encima de él.
# ----- Selección de variables ----- #
# Subconjunto sin la variable "income" a predecir
sin_income_log <- Log_reduced[,c(2:length(Log_reduced))]
# Descarte de variables altamente correlacionadas (findCorrelation)
index_log<-findCorrelation(cor(sin_income_log),cutoff =.5,verbose = TRUE,exact = TRUE)
## Compare row 1 and column 11 with corr 0.687
## Means: 0.446 vs 0.271 so flagging column 1
## Compare row 11 and column 10 with corr 0.539
## Means: 0.347 vs 0.249 so flagging column 11
## Compare row 2 and column 5 with corr 0.535
## Means: 0.332 vs 0.23 so flagging column 2
## Compare row 4 and column 5 with corr 0.52
## Means: 0.321 vs 0.212 so flagging column 4
## Compare row 5 and column 7 with corr 0.683
## Means: 0.253 vs 0.191 so flagging column 5
## All correlations <= 0.5
sin_income_log <- sin_income_log%>%select(-index_log)
# Con el nuevo conjunto de variables, se calcula la matriz de correlación
new_var_train_log<-cbind(Total.Household.Income=Log_reduced[,1],sin_income_log)
cor_mat_log<-cor(new_var_train_log)
cor_mat_log<-cor_mat_log[,order(cor_mat_log[1,],decreasing = T)]
Mapa de calor tras descartar ciertas variables y ordenando el resto de variables seleccionadas frente al income por valor de correlación descendente (fijarse en la última fila de la matriz):
ggcorrplot(t(cor_mat_log), method = "circle") # Representación gráfica del mapa de calor
# Se escogerán las que tengan una correlación > de 0.5 con respecto al "Total.Household.Income"
Variables_ordenadas<-data.frame(t(cor_mat_log)[,'Total.Household.Income']) # Es para quedarse con la columna ordenada
colnames(Variables_ordenadas)<-'Coef. Corr'
View(Variables_ordenadas)
Se realiza la regresión lineal múltiple con las variables cuyo valor de correlación con el Income es superior a 0.4. Después, se calculan los residuos, los valores ajustados, y se visualizan:
# Se realiza la regresión lineal múltiple con las variables cuyo valor de correlación con el Income es superior a 0.4
RLM<-lm(Total.Household.Income~Housing.and.water.Expenditure
+Transportation.Expenditure
+Clothing..Footwear.and.Other.Wear.Expenditure
+Fruit.Expenditure
,data=new_var_train_log)
# Cálculo de residuos del modelo
residuos <-(residuals(RLM))
# Calculo de los valores ajustados con las observaciones de entrenamiento
valores.ajustados <- (fitted(RLM))
# Verificación de la no relación lineal entre valores predichos y residuos
plot(valores.ajustados, residuos,col="black")
# Valores de los betas estimados en la regresión lineal múltiple
summary(RLM)
##
## Call:
## lm(formula = Total.Household.Income ~ Housing.and.water.Expenditure +
## Transportation.Expenditure + Clothing..Footwear.and.Other.Wear.Expenditure +
## Fruit.Expenditure, data = new_var_train_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54742 -0.09141 -0.00512 0.08714 1.06368
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2.488967 0.026375 94.37
## Housing.and.water.Expenditure 0.417984 0.006774 61.70
## Transportation.Expenditure 0.104066 0.003592 28.97
## Clothing..Footwear.and.Other.Wear.Expenditure 0.072035 0.003622 19.89
## Fruit.Expenditure 0.078366 0.005266 14.88
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## Housing.and.water.Expenditure <0.0000000000000002 ***
## Transportation.Expenditure <0.0000000000000002 ***
## Clothing..Footwear.and.Other.Wear.Expenditure <0.0000000000000002 ***
## Fruit.Expenditure <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1437 on 5925 degrees of freedom
## Multiple R-squared: 0.665, Adjusted R-squared: 0.6647
## F-statistic: 2940 on 4 and 5925 DF, p-value: < 0.00000000000000022
El gráfico de dispersión no arroja buenos resultados. La información que de él se obtiene es que, para cada valor predicho, la varianza no es constante. Este gráfico también podría estar informando de que las variables seleccionadas para el modelo de regresión lineal no son quizá las más adecuadas.
Se obtiene un R-squared de 0.6747, lo cual parece indicar que con las variables elegidas se estaría explicando casi el 70% de la varianza de los ingresos (variable a predecir).
Las estimaciones de los betas tienen asociados un p-valor más pequeño que un nivel de significancia de 0.05, lo que indica que es muy poco probable que los betas estimados sean nulos.
El p-valor asociado al F-statistic es muy pequeño, menor que el nivel de significancia 0.05, por lo queel modelo es estadísticamente significativo y que las variables elegidas explican algo de la variable a predecir.
Para comprobar que el resultado es correcto, se observa en la próxima gráfica si los residuos siguen una distribución normal. Para ello, se representa el gráfico Q-Q que compara los cuantiles teóricos de una normal con los calculados. Cuantos más puntos caigan en la recta, mejor. Además, se realiza un test de normalidad con las siguientes hipótesis:
H0: los datos proceden de una distribución normal
H1: los datos no proceden de una distribución normal
qqnorm(residuos)
qqline(residuos)
ks.test(residuos,pnorm,mean=mean(residuos),sd=sd(residuos))
##
## One-sample Kolmogorov-Smirnov test
##
## data: residuos
## D = 0.023329, p-value = 0.003146
## alternative hypothesis: two-sided
residuos<-as.data.frame(residuos)
ggplot(residuos, aes(x=residuos)) +
geom_histogram(color='blue',fill='orange',aes(y=..density..), alpha=0.5,
position="identity")+
geom_density(color='red',alpha=.2)
residuos%>%summarise(media=mean(residuos),
max=max(residuos),
min=min(residuos),
sd=sd(residuos),
var=var(residuos))
Si se observan el histograma y el gráfico Q-Q, parece que los residuos siguen una distribución normal. Sin embargo, en el test de normalidad se obtiene un p-valor muy pequeño, cercano al nivel de significancia alpha = 0.05.
Utilizando el conjunto de test (30% de las observaciones) separado al principio, se evaluará el modelo, para ver sus prestaciones a la hora de calcular los ingresos.
Visualizamos los residuos obtenidos:
# Del conjunto de test, se seleccionan las variables adecuadas
pre_datos_testing_original <- datos_testing %>% filter(Total.Household.Income < 400000)
pre_datos_testing <- pre_datos_testing_original%>%select(Housing.and.water.Expenditure,
Transportation.Expenditure
,Clothing..Footwear.and.Other.Wear.Expenditure
,Fruit.Expenditure)
# Es necesario transformar logarítmicamente el conjunto de test antes de usarlo para validar, pues el conjunto de train estaba transformado logarítmicamente
pre_datos_testing<-log10(pre_datos_testing)
# Quitamos los valores -Inf transformandolos a 0
pre_datos_testing <- replace(pre_datos_testing,pre_datos_testing=="-Inf",0)
# Predicción con el modelo de RLM calculado
predichos_log <- predict(RLM,pre_datos_testing)
# Calculo de los residuos en el dominio transformado (log10)
valores_reales_log<-log10(pre_datos_testing_original$Total.Household.Income)
residuos_log<-valores_reales_log -predichos_log
Es conveniente representar los residuos frente a los valores predichos en el dominio transformado (log10)
# Es conveniente representar los residuos frente a los valores predichos en el dominio transformado (log10)
plot(predichos_log,residuos_log)
Se puede observar que el gráfico no varía mucho con respecto al mismo gráfico representado anteriormente (utilizando el conjunto de training):
df_comp_log10<-data.frame(predichos_log10=predichos_log,Valores_reales_log=valores_reales_log)
# Se obtienen, en un vector, los valores reales para compararlos con los predichos. Para ello, se calculan sus residuos
Valores_reales<-(pre_datos_testing_original$Total.Household.Income)
Valores_predichos<-10^(predichos_log)
# Calculamos los residuos
residuos<-Valores_reales-Valores_predichos
Son representados en un gráfico de dispersión los valores predichos por el modelo con los residuos en unidades normales (deshaciendo el logaritmo)
# Representamos en un gráfico de dispersión los valores predichos por el modelo con los residuos en unidades normales (deshaciendo el logaritmo)
plot(Valores_predichos,residuos)
Se muestra una comparativa entre los valores predichos y los reales:
#Comprobamos los valores predichos y originales en los dos dataFrames siguientes:
df_com_original<-data.frame(predichos=Valores_predichos,reales=Valores_reales,residuos=residuos)
#Originales
df_com_original
#Queremos ver cuanto nos equivocamos viendo los residuos
df_com_original%>%summarise(media=mean(residuos),
max=max(residuos),
min=min(residuos),
sd=sd(residuos),
var=var(residuos))
Se debe tener en cuenta que los valores de ingresos (predichos y reales) están en moneda filipina, y que la equivalencia a euros es: 1 peso filipino –> 0.018 euros.
A la vista del máximo y del mínimo del valor absoluto de los residuos, se puede comprobar que este modelo llega a equivocarse en un máximo de 313226.72 que son: 5638.08096 euros.
Para comprobar la normalidad de los residuos, se utiliza el gráfico Q-Q que compara los cuantiles teóricos de una normal con los calculados. Cuantos más puntos caigan en la recta, mejor.
Llegados hasta este punto, resulta interesante mostrar el gráfico Q-Q, histograma y test de normalidad para los residuos.
# Unidades logarítmicas
qqnorm(residuos_log)
qqline(residuos_log)
ks.test(residuos_log,pnorm,mean=mean(residuos_log),sd=sd(residuos_log))
##
## One-sample Kolmogorov-Smirnov test
##
## data: residuos_log
## D = 0.02115, p-value = 0.2031
## alternative hypothesis: two-sided
residuos_log<-as.data.frame(residuos_log)
ggplot(residuos_log, aes(x=residuos_log)) +
geom_histogram(color='blue',fill='orange',aes(y=..density..), alpha=0.5,
position="identity")+
geom_density(color='red',alpha=.2)
residuos_log%>%summarise(media=mean(residuos_log),
max=max(residuos_log),
min=min(residuos_log),
sd=sd(residuos_log),
var=var(residuos_log))
# Unidades normales
qqnorm(residuos)
qqline(residuos)
ks.test(residuos,pnorm,mean=mean(residuos),sd=sd(residuos))
##
## One-sample Kolmogorov-Smirnov test
##
## data: residuos
## D = 0.08753, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
residuos<-as.data.frame(residuos)
ggplot(residuos, aes(x=residuos)) +
geom_histogram(color='blue',fill='orange',aes(y=..density..), alpha=0.5,
position="identity")+
geom_density(color='red',alpha=.2)
residuos%>%summarise(media=mean(residuos),
max=max(residuos),
min=min(residuos),
sd=sd(residuos),
var=var(residuos))
A lo largo del análisis, han sido descubiertas cosas interesantes que han sido de gran ayuda a la hora de definir un modelo de regresión lineal adecuado a los datos:
La población filipina es una población generalmente pobre, y con muchas desigualdades. Esto conduce a encontrarse con muchos valores atípicos (familias más ricas) y problemas para definir un modelo integrando todos los elementos del data set.
Existen dos o más comportamientos diferentes dentro del data set. Esto significa que, para desarrollar un modelo lógico que sea capaz de predecir los ingresos de una familia, deberemos diferenciar entre la familia filipina común (equivalente al 85% y mayoritariamente pobre) y la familia filipina de clase alta (aquellas que suponen valores atípicos y no son representativas de la situación económica más extendida).
Al analizar una población con tanto nivel de pobreza, debemos entender que hay variables predictoras de los ingresos que en otras sociedades con menor desigualdad y menor escasez económica, si podrían tener significancia, pero en este tipo de datos no la tienen. Por ejemplo, variables como los gastos en ropa, en comunicación, o en bienes de lujo, cuentan con un alto porcentaje de ceros (0). El problema radica en entender que para este tipo de poblaciones, las cuales no gastan practicamente nada fuera de primera necesidad, existen otras variables más adecuadas para predecir los ingresos.
La población filipina debe tratarse como una población pobre, con variables predictoras del nivel de ingresos algo diferentes a otro tipo de población con menores desigualdades, o en todo caso más “lineales”: niveles de ingresos más repartidos entre clases (que exista clase baja, media-baja, media, media alta y alta).